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ABSTRACT 



A computing system (50) includes N number of sym- 
metrical computing engines having N number of cache 
memories joined by a system bus (12). The computing 
system includes a global run queue (54), an FPA global 
run queue, and N number of affinity run queues (58). 
Each engine is associated with one affinity run queue, 
which includes multiple slots. When a process first be- 
comes runnable, it is typically attached one of the global 
run queues. A scheduler allocates engines to processes 
and schedules the processes to run on the basis of prior- 
ity and engine availability. An engine typically stops 
running a process before it is complete. When the pro- 
cess becomes runnable again the scheduler estimates the 
remaining cache context for the process in the cache of 
the engine. The scheduler uses the estimated amount of 
cache context in deciding in which run queue a process 
is to be enqueued. The process is enqueued to the affin- 
ity run queue of the engine when the estimated cache 
context of the process is sufficiently high, and is en- 
queued onto the global run queue when the cache con- 
text is sufficiently low. The procedure increases com- 
puting system performance and reduces bus traffic be- 
cause processes will run on engines having sufficient 
cache affinity, but will also run on the best available 
engine when there is insufficient cache context. 

25 Claims, 8 Drawing Sheets 
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and engine availability. The scheduler uses three dis- 
CACHE AFFINITY SCHEDULER tinct data structures to schedule processes: (1) a global 

run queue 4 (FIG. 2), (2) a floating point accelerator 
TECHNICAL FIELD (FPA) global run queue, and (3) an engine affinity run 

This invention relates to scheduling of processes that 5 ^ue 38 (FIG. 3) for each engine. The FPA global run 
run on computing engines in a multiprocessor coraput- <H>eue is the same as global run queue 34 except that the 
ing system and, in particular, to a scheduler that con- FPA global run queue may queue processes requiring 
siders erosion of cache memory context when deciding hardware. 

in which run queue to enqueue a process. FIG- 2 illustrates the prior art global run queue 34 

10 used by system 10. Referring to FIG. 2, global run 
BACKGROUND OF THE INVENTION queue 34 mc i u des an array qs and a bit mask whichqs. 

FIG. 1 is a block diagram of major subsystems of a Array qs is comprised of 32 pairs (slots) of 32-bit words, 
prior art symmetrical multi-processor computing sys- each of which points to one linked list. The organization 
tern 10. Examples of system 10 are models S27 and S81 of the processes is defined by data structures. Each slot 
of the Symmetry Series manufactured by Sequent Com- 55 includes a ph— link field and a ph—rlink field, which 
puting Systems, Inc., of Beaverton, Oregon, the as- contain address pointers to the first address of the first 
signee of the present patent application. The Symmetry process and the last address of the last process, respec- 
Series models employ a UNIX operating system with tively, in a double circularly linked list of queued pro- 
software written in the C programming language. The cesses. The 32 slots are arranged in priority from 0 to 
UNIX operating system, which is well known to those 20 31, as listed at the left side of FIG. 2, with priority 0 
skilled in the art, is discussed in M. Bach, The Design of being the highest priority. 

the UNIX Operating System, Prentice Hall, 1986. The C The bit mask whichqs indicates which slots in the 
programing language, which is also well known to array qs con tairi processes. When a slot in qs contains a 
those skilled in the art, is described in B. Kernighan and process, whichqs for that slot contains a "1". Otherwise, 
D. Ritchie, The C Programming Language, 2d Ed., 25 the wn ichqs for that slot contains a "0". When an en- 
Prentice Hall, 1988. gi ne for a process to run, the engine finds the 
Referring to FIG. 1, system 10 includes N number of highest priority wh ichqs bit that contains a 1, and dequ- 
computing engines denominated engine lp. engine 2p> eucs ( . detaches) a process from the corresponding 
engine 3p, . . . , engine N p (collectively "engines s]ot in 

1,-N/) Each one of engines VN,is a hardware com- 30 F]G 3 Ulustrates thc m 3g used b 

puter which includes a microprocessor such as an Intel w Uflder defau , t after a rocess be . 

386 and associated circuitry. System 10 is called la sym- £ ^ - . ( . tQ ^ 

metrical multi-processor computing sys em oecause of e - ppA 

run Queue, rather 

each one of engines lfl-Nn has equal control over system » u - . 

10 hole 35 t 311 a ^ n,t y 11111 q ueue 38. However, a process 

Each one of engines l r li p has a local cache memory ™* be " h t ^ • ffin ^f d " to ™ on * ona P^J" 
denominated cache 1„ £che 2„ cache 3, , . . . , cache In the process is enqueued only to *e 

N„ respectively (collectively '"caches l p -K p "). The ^ eue !? g . ,ne (ra ' her than a g l° ba 

p£poseTf a cache memory is to provide high-speed ™ n untl1 the ^affinity condition is ended. Of 

access and storage of data associated with processes 40 coursc » thcrc «? tl ™ s ' for "ample, when it is asleep, 
performed by an engine. A system bus 12, joins engines when a ^ afTmitied process is not queued to any run 
lp-Nfl to a main RAM memory U p of system 10. Data queue. 

stored in one of cache memories 1 P -N„ can originate Referring to FIG. 3, each one of engines 1,-N, is 
from the corresponding engine, the cache memory of associated with its own affinity run queue 3*. Each 
another engine, a main memory 14* a hard disk con- 45 affinity run queue 38 has an e_head field and an e_tail 
trolled by a disk controller 18, or an external source data structure, which contain the address pointers to the 
such as terminals through a communications controller ^st and last address, respectively, of the first and last 
20, processes in a doubly circularly linked list of affinitied 

Cache memories 1,-Np are each organized as a processes. When a process is hard affinitied to an en- 
pseudo least-recently-used (LRU) set associative mem- 50 g«ie, it is enqueued in FIFO manner to the double circu- 
ory. As new data are stored in one of the cache memo- larly linked list of the affinity run queue 38 that corre- 
ries, previously stored data are pushed down the cache sponds to the engine. The FIFO arrangement of each 
memory until the data are pushed out of the cache mem- linked list is illustrated in FIG. 3. Each of the linked lists 
ory and lost. Of course, the data can be copied to main of the slots of global run queue 34, shown in FIG. 2, has 
memory 14 or another cache memory before the data 55 the same arrangement as the linked list shown in FIG. 3. 
are pushed out of the cache memory. The cache context Affinity run queue 38 differs from global run queue 
of a process with respect to an engine "erodes" as data 34 in the following respects. First, global run queue 34 
associated with a process is pushed out of the cache has 32 slots and can, therefore, accommodate 32 linked 
memory of the engine. lists. By contrast, each affinity run queue 38 has only 

A scheduler determines which engine will run a pro- 60 one slot and one linked list. Second, as a consequence of 
cess, with a highest priority process running first. On a each affinity run queue 38 having only one linked list, 
multi-processor system, the concept of priority is ex- the particular engine corresponding to the affinity run 
tended to run the highest n number of priority processes queue 38 is limited to taking only the process at the head 
on the m number of engines available, where m =n of the linked list, even though there may be processes 
unless some of the engines are idle. In system 10, the 65 having higher priority in the interior of linked list, 
scheduler is a software function carried on by the Third, the linked list of processes must be emptied be- 
UNIX operating system that allocates engines to pro- fore the engine can look to a global run queue for addi- 
cesses and schedules them to run on the bases of priority tional processes to run. Accordingly, affinity run queue 
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38 docs not have a priority structure and runs processes cache-to-cache bus traffic increases along with the 
in round robin fashion. Fourth, as noted above, only cache memory size thereby frustrating that effort. Like- 
hard affinitied processes are enqueued to affinity queue wise, adding multiple disk drives increases the require- 
38. ment for disk I/O bandwidth and capacity on the bus. 

Because of the non-dynamic and explicitly-invoked 5 When a process moves from a previous engine to a 
nature of hard affinity, it is used mostly for performance new engine, there is some cost associated with the tran- 
analysis and construction of dedicated system configu- sit ion. Streams of cache data move from one engine to 
rations. Hard affinity is not used in many customer another, and some data are copied from main memory 
configurations because the inflexibility of hard affinity 14. Certain traffic loads, database traffic loads in panic- 
does not map well to the complexity of many real world 10 ular, may result in a bus saturation that degrades overall 
applications. system performance. Data are transferred over system 

The lifetime of a process can be divided into several bus 12p as data are switched from main memory 14p or 

states including: (1) the process is "runnable," (i.e., the the previous cache memory to the new engine and the 

process is not running, but is ready to run after the new cache memory. However, each time a process runs 

scheduler chooses the process), (2) the process is exe- 15 from the global run queue, the odds that the process will 

cuting (i.e., running) on an engine, and (3) the process is run on the same engine as before approaches 1/m, 

sleeping. The second and third states are well known to where m is the number of active on-line engines. On a 

those skilled in the art and will not be described herein large system, m is usually 20 or more, giving less than a 

in detail. S% chance that the process will run on the same engine 

When a process is runnable, the scheduler first checks 20 as before. However, it is difficult to accurately charac- 

to see whether the process is hard affinitied to an en- terize the behavior of an operating system. The actual 

gine. If the process is hard affinitied, it is enqueued onto odds will, of course, depend on CPU and I/O load and 

the FIFO linked list of the affinity run queue 38 associ- the characteristics of the jobs running, 

ated with the engine. In many situations, it is desirable for an engine to stop 

If the process is not hard affinitied to an engine and is 25 running an unfinished process and perform another task, 

not marked for FPA hardware, then the process is For example, while an engine is running one process, a 

queued on one of the linked lists of qs in global run higher priority process may become runnable. The 

queue 34 according to the priority of the process. The scheduler accommodates this situation through a tech- 
appropriate bit in whichqs is updated, if necessary. If nique called "nudging." A nudge is a processor-to- 

the process is not affinitied, but has marked itself as 30 processor interrupt that causes a destination processor 

requiring FPA hardware, the process is queued to one to re-examine its condition and react accordingly. In the 

of the linked lists of fpa_qs in the FPA global run case of a higher priority process, the "nudged" destina- 

queue, and the appropriate bit in fpa_ whichqs is up- tion engine will receive the interrupt, re-enter the oper- 

dated, if necessary, where fpa_ qs and fpa_ whichqs are ating system, notice that there is higher priority work 

analogous to qs and whichqs. 35 waiting, and switch to that work. Each nudge has a 

When an engine is looking for a process to run, the corresponding priority value indicating the priority of 

engine first examines its affinity run queue 38. If affinity the event to which the engine responds. As an optimize 

run queue 38 contains a process, the first process of the tion, the priority of the nudge pending against an engine 

linked list is dequeued and run. If affinity run queue 38 is recorded per engine. When nudge is called for prior- 

for a particular engine is empty, then the scheduler 40 ity less than or equal to the value already pending on the 

examines whichqs of global run queue 34 and fpa_. engine, the redundant nudge is suppressed, 

whichqs of the global FPA run queue to see whether When a process becomes runnable, the scheduler 

processes are queued and at what priorities. The process scans engines lp-Np for the engine(s) running the lowest 

having the higher priority runs first. If the highest prior- priority process(es). If the newly runnable process has 

ities in whichqs of global run queue 34 and fpa — 45 an. equal or greater priority than the presently running 

whichqs in the FPA global run queue are equal, the processes, the engine (or one of the engines) with the 

process in the FPA format runs first. highest priority (e.g., engine lp) is "nudged" to resch- 

A goal of system 10 is to achieve a linearly increasing eduie in favor of the newly runnable process. Conse- 

level of "performance" (i.e., information processing quently, a process (e.g., process X) ceases to run on 

capacity per unit of time), as engines and disk drives are 50 engine lp, at least temporarily. During the time other 

added. An obstacle to meeting that goal occurs when processes are running on engine lp, the cache context 

there is insufficient bus bandwidth (bytes per unit time for process X erodes. 

period) to allow data transfers to freely flow between Two problems with the prior art scheduler are illus- 
subsystem elements. One solution would be to increase trated by considering what happens when process X (in 
the bandwidth in system bus 12 12. However, the band- 55 the example above) becomes runnable. First, if process 
width of system bus 12 p is constrained by physical cabi- X is not hard affinitied, it is enqueued to a global run 
net and connector specifications. queue. However, as noted above, there is only approxi- 
The problem of inadequate bandwidth is exacerbated mately a 1/m chance that process X will next run on 
because system 10 allows customers to add additional engine lp. Therefore, even though the cache context for 
disk drives and engines to increase the value of the 60 process X may be very high in cache lp of engine lp, 
number Np after the system is in the field. In addition, in process X will probably be run on another engine. If 
system 10, engines lp-Np and the disk drives may be process X is run on another engine, some of the capabil- 
replaced with higher performance engines and disk ity of system 10 will be used in moving data over system 
drives. Increased engine performance " increases the bus 12p. As data is moved over system bus 12p to the 
number of instructions processed per operating system 65 other engine, system performance may be reduced, 
time slice. This in turn requires larger cache memories Second, if process X is hard affinitied, it will be en- 
on the processor boards in an effort to reduce memory- queued onto affinity run queue 38 of engine lp, regard- 
to-processor traffic on the main system bus. However, less of how many other processes are enqueued onto 
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affinity run queue 38 of engine l p and regardless of FIG. 4 is a block diagram of the major subsystems of 
whether other engines are idle. Therefore, system per- a symmetrical multi-processor computing system ©per- 
formance may be reduced because of idle engines. ating in accordance with the present invention. 

Thus, the prior art scheduler poorly reuses cache FIG. 5 A is a schematic diagram of a global run queue 

memory unless the flexibUity of symmetrical muhipro- 5 and according to the present invention, 

cessing is given up by using hard affinity. Therefore, FIG 5B is a simplified version of the diagram of FIG. 

there is a need for a scheduler that causes a runnable ^A. 

process to be enqueued onto the affinity run queue of an ^ 6 is a thematic diagram of an affinity run queue 

engine when the cache context (or warmth) of the pro- 8C S5?™* ^° thc prc " nt . "^f? 11011 * 

cess with respect to the engine is sufficiently high, and 10 ^ 7 illustrates relationships among data structures 

to be enqueued onto a global run queue when the cache ac £??-iV > * C P^.^vcntion. 

context is sufficiently low. Additionally, periodic CPU ™ G 8 dlus rates relationships among data structures 

load balancing calculations (schedcpuO)^uld be im- of thc c ^1?^° P«P«* W of 

» . t . , \ • r \ j process data structure, 

proved to maintain a longer-term v ie w of engme load piQ. 9 illustrates relationships among data structures 

and to cause redistribution of processes if a sigmficam of ^ m mvention from ^ perspective of the 

excess of processes exists at any particular engine. Fur- engme data structure, 
ther, such redistribution of processes could consider the 

priority of the processes to be moved. DETAILED DESCRIPTION OF PREFERRED 

. n EMBODIMENT HARDWARE 
SUMMARY OF THE INVENTION 20 

A preferred embodiment of the present mvention is 

An object of the present invention is to provide a implemented in a symmetrical multi-processor comput- 

linear performance increase as engines having higher mg system 50( sn0 wn in FIG. 4. Referring to FIG. 4, 

performance microprocessors and larger cache memo- system 50 includes N number of computing engines 

ries are added to a system. 25 denominated engine 1, engine 2, engine 3, . . . , engines 

Another object of the invention is to provide a sched- N (collectively "engines 1-N"). Each one of engines 

uler that enqueues processes to either an affinity run i-N has a local cache memory denominated cache 1, 

queue or a global run queue so as to, on average, maxi- cache 2, cache 3, . . . , cache N, respectively ("collec- 

mize performance of a multi-processor computing sys- tively caches 1-N")i each of which is organized as a 

tern. 30 pseudo LRU FIFO, described above. The hardware of 

A further object of the invention is to optimize the system 50 may be identical to the hardware of prior art 

assignment of processes to engines in an attempt to system 10, with the only changes being software addi- 

minimize engine-to-engine movement of processes and tions and modifications. Alternatively, in addition to the 

their related cache memory context. changes in software, one or more of engines 1-N may 

Still another object of the invention is to increase the 35 be different from the engines of engines l p -Np. Caches 
effective bus bandwidth in a multi-processor computing 1-N and main memory 14 may have different capacities 
system by reducing the amount of unnecessary bus from those of caches 1/j-Np and main memory 14^, re- 
traffic, spectively. System bus 12 may have a different number 

This invention satisfies the above objects by imple- of conductors from those of system bus 12 p . Additional 

menting data structures and algorithms that improve 40 potential modifications and additions to system 10 are 

the process allocating efficiency of the process sched- described below, 

uler by reducing cache-to-cache traffic on the bus and Overview of data structures 

thereby improve the overall computing system perfor- , - 

mance. The scheduler uses the concept of cache context v In pnor art system 10, each run queue a process could 

* . . * j * *v «- 45 be scheduled from had a discrete data structure. Global 

whereby a runnable process is enqueued onto the affin- * J uc =^ llcuul *« t " " /TV ~f ™ * i u i 

™ ' ^ rtr „ tUo „ *u a oct;nvi t^ „ 0 m,o run queue 34 included whichqs and qs. FPA global run 

lty run queue of an engine when the estimated cache n . , , . , r , r ~ ., 

* * r *u . * queue included whichqs_fpa and qs_fpa. Each engme s 

context of the process with respect to the engine is < mcluded e_h Jd and e.tail. 

sufficiently high, and is enqueued onto a global run ^ embodiment ^ g sin le 

queue when the estimated cache context is sufficiently 5Q dal / smict strucl ^ * used insteadt * QI 

low. A premise underlying the operation of the present M m lQ treatc<J ^ ^ of data struc . 

invention is that it is often more efficient to wait for a turfi ^ s ^ iaJ<ase ^ the pre f erTe d embodiment 

busy engme where cache context already exists than to fa the me codCj changing only which mstance of 

move to a waiting engine where cache context will need rtruct runq h ^ operatin g upon< 

to be transferred or rebuilt. 55 As }3Scd hereill) "cache context" is a measure of bow 

Additional objects and advantages of this invention much of ^ daU associated with a process is in a cache 

will be apparent from the following detailed description memory. When the data is initially copied from main 

of a preferred embodiment thereof which proceeds with memory 14 to a cache memory, the cache context of the 

reference to the accompanying drawings. cache memory with respect to the process is high (in 

BRIEF DESCRIPTION OF THE DRAWINGS 60 100% >' ^ cache context of the cache memory 

with respect to the process decreases or "erodes" as 

FIG. 1 is a block diagram of the major subsystems of data in the cache are pushed out of the cache memory as 

a prior art symmetrical multi-processor computing sys- data not associated with the process are added to the 

tem. cache memory. A cache memory is "warm" if the esti- 

FIG. 2 is a schematic diagram of a prior art global run 65 mated amount cache context is above a certain level. A 

queue. cache memory is "cold" if the estimated amount of 

FIG. 3 is a schematic diagram of a prior art engine cache context is below a certain level. A process has 

affinity run queue. "cache affinity*' with respect to an engine if a pointer 
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(*p_rq) of the process points to the affinity run queue, may be performed either by existing hardware in prior 

described below, of the engine. art system 10 with additional software, or by additional 

System 50 includes at least one global run queue, such hardware and software, 

as global run queue 54 shown in FIGS. 5A and 5B, and As each process leaves an engine, a pointer (*p_ru- 

an affinity run queue 58 shown in FIG. 6 for each en- 5 neng) that points to the engine is stored. The value 

gine 1-N. Global run queue 54 and affinity run queue 58 (p—runtimc) of the counter of the engine is also stored, 

are examples of the "struct runq" data structure, de- For example, at time ti, process X leaves engine 2. 

scribed below. FIG. 5B is a simplified version of FIG. Accordingly, •p_runeng= engine 2 and the count 

5A. The structures of global run queue 54 and affinity (p_runtime) of counter 2 at time ti are stored. At time 

run queue 58 are similar to the structure of global run 10 t2, the process becomes runnable again. At time t3, the 

queue 34 in that they may queue processes in different scheduler calculates the difference (D^) between the 

linked lists, according to priority. Each linked list in the stored p_runtime and the count (e— runtime) of counter 

multiple slots of global run queue 54 and affinity run 2 at time * ri cp \$ inversely related to the probable 

queue 58 has the same structure as that of the linked list amount of cache context remaining. The scheduler uses 

in the single slot of prior art affinity run queue 34, IS D*.;> in estimating how much cache context remains for 

shown in FIG. 3. the process with respect to the engine. 

Although each run queue has the same data structure An engine is pointed to by and can run processes 
as that of global run queue 54, particular types of pro- from two or more run queues. In the examples de- 
cesses may be queued to only a certain type of run scribed herein, system 50 employs three types of run 
queue. For example, a first group of engines 1-N could 20 queues, each having the identical data structure: a 
contain Intel 386 microprocessors, a second group of global UNIX run queue 54, an FPA global run queue, 
engines 1-N could include Intel 486 microprocessors, and one affinity run queue 58 for each engine. When an 
and a third group of engines 1-N could contain FPA- engine is activated, it is pointed to by a list of the appro- 
equipped Intel 386 microprocessors. In this example, priate run queues. Processes are generally initially at- 
each engine would point to its own affinity run queue 25 tached to the global run queue, and may be moved to 
58. In addition, the first group of engines 1-N would other run queues as described below, 
point to global run queue 54. The second group of en- Conceptually, a process does not choose to run on a 
gines 1-N would point to a 486 type global run queue. particular engine; instead, a process is enqueued to a run 
The third group of engines 1-N would point to an FPA queue, and then is run by an engine that is a member of 
global queue. System 50 could include additional types 30 an engine list pointed to by the run queue. If the engine 
of global run queues including global , run queues for list contains only a single engine, the run queue is an 
engines having an expanded instruction set, a reduced affinity run queue and the process is affinitied to the 
instruction set, on-chip cache, or other properties. In engine. If the engine list contains only FPA-equipped 
the case of an engine with a microprocessor with on- engines, the run queue is an FPA run queue and the 
chip cache, the present invention preferably would 35 process has FPA affinity. A global run queue points at 
optimize the use of both on-chip and on -board cache. an engine list that contains all of the engines on the 

In some circumstances, a process may be enqueued at system that will run processes of the type queued to the 

different times to more than one type of global run global run queue. 

queue. For example, a process that is compatible with Each process has a data structure that has two point- 
the Intel 386 format may be able to run on both an Intel 40 ers to run queues: a current run queue pointer (*p_rq) 
386 and an FPA engine. In that case, depending on the and a home run queue pointer (•p—rqhorac). The cur- 
demands on the different types of engines, the process rent first pointer indicates the run queue to which the 
could be enqueued to either global run queue 54 or the process is enqueued when it becomes runnable. The 
FPA global run queue. home run queue pointer indicates the home run queue of 

The scheduler according to the present invention 45 the process. When a process is first created, both point- 
comprises the conventional UNIX scheduler to which ers generally point to a global run queue. (The home 
is added the new and changed data structures and algo- run queue may be an affinity run queue if the process is 
rithms described herein. hard affinitied to an affinity run queue.) However, when 

When a process becomes runnable, the scheduler a process is moved to an affinity run queue (because of 
selects the run queue to which the process is to be en- 50 sufficient cache context), the current run queue pointer 
queued. System 50 may use any conventional means to is moved, but the home pointer is unchanged. If the 
enqueue a process to a run queue. The scheduler selects process needs to move from its current run queue (be- 
the run queue by considering an estimation of how cause of insufficient cache context or the engine is shut- 
much cache context (if any) the process has with re- ting down), the current run queue pointer of the process 
spect to the cache of an engine. The estimation of cache 55 is moved back so that the current run queue pointer 
context with respect to a process is based on the number points to the home run queue. For an FPA process, the 
of user (as opposed to kernel) process clock ticks an home run queue pointer points to an FPA global run 
engine has executed since the last time it ran the pro- queue. This allows the scheduler to apply cache affinity 
cess. to more than one global scheduling pool of processes, 

The process clock ticks arc produced by a clock 60 without the need for consideration of special cases, 
routine, named hardclock. The hardclock routine is The following sequence of operations is exemplary of 
entered for each engine one hundred times a second. If the handling of processes by the cache affinity sched- 
the routine is entered during a time when a particular uler of the present invention. When a process is first 
engine is running a user process, then a 32-bit counter created, it is attached to its current run queue which is 
associated with the particular engine is incremented. 65 generally the global run queue. If the process has a 
Engines 1-N include counters 1-N, respectively, shown higher priority than that of the process currently run- 
in FIG. ' 4. For example, counter 2 is included with ning in any engine that is a member of the engine list for 
engine 2. The functions of hardclock and counters 1-N the current run queue, the engine is nudged to decide 
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whether to immediately run the newly enqueued pro- 
cess in place of the running process. 

In addition, an engine can become idle and look for a 
runnable process among any of the run queues that are 
on the list of run queues for that engine. The engine then 
selects the highest priority process found, and runs it. 
Sometime later the process will cease running, at which 
time the counter value (p— runtime) for the process is 
stored. Often when a process stops running it enters a 
sleeping state in which the process waits for another 
action such as a key stroke prior to becoming runnable 
again. 

When the process subsequently becomes runnable, 
cache context becomes a consideration in deciding to 
which run queue the process should be enqueued. The 
scheduler first examines the p— flag field of the process 
to determine whether the cache affinity bit (SAFFIN) is 
set (i.e., = 1). If the cache affinity bit is set, the scheduler 
considers whether the process has cache affinity in 
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*p_rqhorae and *p— rq are the home and current run 
queue pointers. As described below, *p_rq points to the 
run queue to which the process will enqueue itself. 
Therefore, if the process chooses to enqueue itself to a 
different run queue (because of cache affinity consider- 
ations or an explicit hard affinity request), the *p_ rq 
field is updated accordingly. *p_rqhome keeps track of 
the base or home run queue of the process. Absent hard 
affinity, *p_rqhome points to a particular global run 
queue, depending on the type of process. For example, 
if the process is an FPA type process, *p_rqhome for 
the process points to the FPA global run queue. 

p_ runtime holds the engine hardclock counter value 
at the time this process last ran. 

*p_runeng is a pointer to the engine this process last 
ran on. 

SAFFIN (cache affinity bit) is a new bit in the exist- 
ing p_flag field. Cache context will be considered only 
for processes having this bit set. This bit is inherited 



deciding in which run queue the process will be en- 20 through forkQ, and all processes initiated from initial 



queued. 

The scheduler next inspects the engine on which the 
process last ran to determine the current counter value 
for that engine (e_runtime). D^=e_runtime— p_run- 
time is the accumulated clock tick value for other user 25 
processes that have run or are running on the engine 
since the process last ran there. If D e -p is low (e.g., less 
than 3), the estimated cache context is high. Accord- inVe-n * 
ingly, on average, the performance of system 50 will be strucTmnqi »e_rqi; 
increased by enqueuing the process to the affinity run 30 struct nmq •e_xq; 



startup of the system will have the cache affinity bit set. 
The bit is cleared when a process hard affinities itself to 
an engine, as cache affinity is then a moot factor. 

1 Added fields in the engine data structure 



struct engine { 



queue rather than to a global run queue. In that case, the 
process switches its current run queue pointer from the 
process* home run queue to the affinity run queue for 
that engine. If D^is high (e.g. more than 15), the esti- 
mated cache context is low and the process is moved 35 
back to its home run queue. 

Data structure descriptions 

The cache affinity scheduler employs a set of data 
structures and algorithms that handle processes accord- 40 
ing to the general description given above. Some of the 
data structures are new, and others are existing data 
structures for which additional fields have been defined. 
For existing data structures, the added fields are de- 
scribed. Fields not disturbed in the conventional UNIX 45 
data structures are indicated by in the listings. For 
new data structures, the entire declaration is included. 



ulong e_runtimc; 
struct runq *e_pushto; 
int e_pushcnt; 

};' 



/• priority of current process •/ 
/• list of ran queues lo run from •/ 
/* affinity run queue of engine ■/ 
/* clock for cache warmth calc •/ 
/• load balancing, where to push V 
/* # processes to push there. •/ 



1. Added fields in the process data structure 



struct proc { 

struct runq •p_xqhome; 
struct runq *p— rq: 
ulong p_runtime; 
struct engine *p_runeng; 
Jfdefuie SAFFIN 0x4000000 

);" 



/* home run queue of process V 
/• current run queue •/ 
/• eng time when last run */ 
/• eng the process last ran on •/ 
/* use affinity in scheduling */ 



50 



55 



The following is an explanation of the added process ^ 
data structure fields: 



The following is an explanation of the added fields in 
the engine data structure: 

e—npri is a field used in the engine data structure of 
prior art system 10. Although e—npri is not added by 
the present invention, e_npri is included here because it 
is discussed below with respect to the timcslice algo- 
rithm, e—npri records the priority of a process the en- 
gine may be currently running. The scheduler uses 
e_npri to correct for ties in priority of processes so that 
processes from certain run queues are not continuously 
ignored. 

e_rql maintains the linked list of the run queues from 
which this engine schedules. 

c— rq indicates the affinity run queue for this engine 
(i.e., a run queue whose only member is this engine). 

e_ runtime is the engine counter value that is com- 
pared with p_ runtime to calculate the amount of cache 
context (cache warmth) remaining for this engine. 

e_pushto identifies the engine to which processes 
will be moved if process load balancing is required. The 
algorithms controlling load balancing are described 
below. 

e_ pushent identifies the number of processes that 
will be moved to a different engine if load balancing is 
used. 



3. Run queue data structure 

/* a place to enqueue a process for running •/ 
struct runq { 

int r_whicbqs; /• bit mask of runq priority levels with waiting processes */ 

struct prochd r_qs(NQS]; /• Run queues, one per bit in r_whichqs •/ 

int r_pm embers; /* # processes belonging to this queue */ 
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•continued 




3. Run queue data structure 


int r_em embers; 
$truct engl *r_engs; 
unsigned r —flags; 
struct runq •r_act; 
); 


/• # engs scheduling from this queue V 

/• a list of those engines •/ 

/• miscellaneous (lags */ 

/• a pointer to next active runq V 





20 



The following is an explanation of the run queue data 
structure fields: 

r— whichqs and r— qs correspond to the structures 
whichqs and qs used in prior art global run queue 34. 
prochd is the pair of ph_link and ph—rlink (which are 55 
described in connection with FIG. 2) for each slot in a 
run queue according to the present invention. Each bit 
in r_whichqs corresponds to an index in r_qs[NQS); 
the bit is set if there is a process queued in that slot. 
"NQS" means the "number of queue slots," for exam- 
ple, 32 slots. 

r— pmembers and r_emembers count the number of 
processes and engines, respectively, belonging to this 
run queue. 

*r— engs is a pointer to a linked list of all engines 25 
scheduling from this run queue. 

r_flags holds miscellaneous flags. 

*r_act is a pointer to a singly-linked list of all run 
queues active on the system. It is used primarily by 
timesliceO to implement timeslicing priority among 
processes of equal priority on the system. Timeslicing is 
discussed below. 



30 



4. Runq list and engine list data structures 



35 



struct runql { 



}; 

struct engl { 



struct mnq *rql_runq; 
struct runql *rql_ next; 



}; 



struct engine *el_eng; 
struct engl *el next; 



40 



The struct runql and struct engl data structures define 
circularly linked lists of run queue and engine members. 45 
The struct runql statement defines the circular list of 
run queues to which an engine belongs, and is pointed to 
from the *e_rql field of the engine data structure. The 
struct engl statement defines the circular list of engines 
that belong to a run queue, and is pointed to from the 50 
♦r_engs field of the run queue data structure. The lists 
are organized circularly to allow implementation of a 
conventional round robin scheduling routine. To avoid 
scheduling inequalities, the scheduling code sets its list 
pointer (♦e_rql for engines; *r_engs for run queues) to 55 
the last entry operated upon, and starts looping one 
entry beyond the last entry. Because the lists are circu- 
lar, the implementation requires only the assignment of 
the list pointer. 

FIGS. 7, 8, and 9 illustrates relationships among van- 60 
ous data structures according to the present invention. 
FIG. 7 iUustratcs only a single engine list and a single 
run queue list, whereas, there are actually multiple lists 
of engines and multiple lists of run queues. FIG. 8 illus- 
trates relationships among data structures of the present 65 
invention from the perspective of the process data 
structure, for a single process. FIG. 9 illustrates rela- 
tionships among data structures of the present invention 



from the perspective of the engine data structure, for a 
single engine. 

Pseudo-code for algorithms 

The following is a description of the algorithms that 
govern the operation and relationship of processes, run 
queues, engines, and lists according to the present in- 
vention. The algorithms are expressed in pseudo-code 
format below. Multi-processor system 50 may use any 
conventional means to perform the functions of the 
algorithms, which are described below in detail. 

. 1. Set process runn able (setruh/setrq) 

The following algorithm is called setrun/setrq: 
If process allows cache affinity 

calc— affinity. 
Insert process in r_qs, update r_ whichqs. 
Find lowest priority engine in run queue. 
If it is lower than the process 
Nudge the engine. 

When it is newly created, or has awakened from a 
sleeping state, a process is set runnable by the above 
setrun/setrq algorithm. The run queue in which the 
process will be placed is a function of whether it ever 
ran before, where it may have run before, and the calcu- 
lated amount of cache context for the process. The scan 
for the lowest-priority engine traverses the »r— engs list 
from the run queue data structure. If the process is 
attached to an affinity global run queue, then the lowest 
priority engine is the only engine associated with an 
affinity run queue. 

2. Calculate cache affinity (calc—affinity) 

The calculation of cache affinity (i.e., the calc_af- 
finity of the setrun/setrq algorithm) of a newly runnable 
process is described in the pseudo-code routine below. 
The following algorithm is called calc—affinity: 



If process never ran 
return 

If process is currently on affinity run queue 
If no cache warmth or shutdown 

Leave the affinity run queue. 

Else 

If cache warmth and not shutdown 

Join the affinity run queue. 



This pseudo-code represents the basic process for 
utilizing cache affinity; if the process has cache warmth, 
attach the process to the affinity run queue; if the cache 
is cold, attach the process to its home run queue. The 
exact number of clock ticks for "cold" and "warm" 
cache values are patchable parameters, to allow imple- 
mentation specific applications. A value of D^less than 
a lower limit Lu indicates a warm cache. A value of 
Df. p greater than an upper limit L vpiKr indicates a cold 
cache. In a preferred embodiment, Li ow *r and L up ~ 

per=15. 
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If the value of is between Li ower and L U pper, the 
scheduler considers the current state of run queue 
pointer *p_ rq in deciding in which run queue to enqu- 
eue a process. As described above, if a process is affinit- 
ied to an engine, *p_ rq points to the affinity run queue 
of that engine. If the process is not afflnitied, then 
*p— rq points to the home run queue, which is typically 
global run queue 54. In essence, the algorithm states, if 
a process is not affinitied, then the run queue to which 
*P— rq points will not change unless the cache context is 
high. If a process is affinitied, then the run queue to 
which *p— rq points will not change unless the cache 
context is low. The gap between L/ 0Krr and L up per thus 
builds hysteresis into the run queue switching algorithm 
and prevents pointer oscillations. 

The hysteresis scheme is summarized in the table, 
below: 





•p_rq 


Run queue process 
enqueued to 


Df— p < L/o>w 


global 


affinity 




affinity 


affinity 


L/ow — = ^ upper 


global 


global 


affinity 


affinity 


T)t—p > Supper 


global 


global 


affinity 


global 
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As can be seen from the table, if D e -p<Li 0 wen the 
process is enqueued to the affinity run queue of the 
engine (*p— runeng) on which the process last ran, re- 
gardless of whether *p— rq points to the affinity run 
queue or the global run queue. If Lh**r=&e.p=L U p P en 
then the process is enqueued to whatever run queue 
*p— rq points to. If D e .p>L U pper> then the process is 
enqueued to the global run queue, regardless of whether 
*p_ rq previously pointed to the affinity run queue or 
the global run queue. In the case where D e .p>L tt ^r, if 
*p_rq previously pointed to the affinity run queue, 
*p— rq is changed to point to the global run queue. 

The hysteresis scheme is illustrated by the following 
example. At time to, both *p— rq (current run queue 
pointer) and *p_ rqhome (home run queue) of process X 
point to global run queue 54. At time t], process X is run 
by engine 1, and *p— rq points to engine 1. At time t2, 
engine 1 stops running process X, which then goes to 
sleep. At time t2, the count of the counter of engine 
number 1 is Ci, which is stored in memory as p_ run- 
time. At time tj, process X becomes runnable, and at 
time U, the scheduler decides whether to enqueue runn- 
able process X to the affinity run queue 58 of engine 1 or 
to global run queue 54. At time U> the count of the 
counter of engine 1 is Ci +5, which is stored as e run- 
time. Therefore, De-p=(Ci + 5)-Ci=5. As noted 
above, in a preferred embodiment, the L/ohw— 3 and 
hypper Because, Liow=S^L upp€ r, *p— rq contin- 
ues to point to the global run queue. 

Continuing the example, at time t$, process X is run 
by, for example, engine 5. Therefore, *p_ rq points to 
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Because, D e . p =2<Li 0 wen process X is enqueued to the 
affinity queue 58 of engine 5. 

Then, the next time process X becomes runnable, the 
scheduler will decide whether process X should be 
enqueued to the affinity queue 58 of engine 5 or global 
run queue 54. However, because process X is affinitied 
to engine 5, D f . p must be. greater than Li ower = 15 in 
order for process X to be enqueued to global queue 54. 

In a preferred embodiment, the calculation of D e . p is 
made before the decision of which run queue to place 
the process. There is some latency time between the 
time the decision of which run queue to place the pro- 
cess in and the time that the process is actually run on an 
engine. The cache context for that process may have 
eroded during the latency time. Therefore, the latency 
time should be considered in choosing the values for 
Liower and V up per The relatively low values assigned to 
L/oMvr and "Lupper in the preferred embodiment compen- 
sate somewhat for cache context "cooling** during the 
latency time. 

3. Switch to a new process (swtch) 

The following algorithm is called swtch: 
Record current engine, engine time in proc 
Loop: 

If shutdown 

Shut down. 
Find highest priority process in run queue. 
If found something to do 

Run it. 
idle (returns runq) 
If idle found something 
Run it. 

End loop. 



The first step updates the p_runtime and *p_runeng 
fields. These are then used by the setrqO algorithm to 
implement cache affinity. The loop also calls a function 
to find the run queue containing the highest priority 
process under the engine's list. If a run queue with a 
40 runnable process is found, the next process from that 
run queue is dequeued and run. If a runnable process is 
not found, a subroutine, idleO. is called to implement 
idleness. IdleO also returns a run queue as idleO had this 
information available to it. This loop then takes the next 
process from this run queue, and runs it. IdleO can also 
detect that the idling engine has been requested to shut- 
down. In this case, idleO runs without a run queue. This 
causes swtch to go back to the top of its loop, detect 
that the engine has been requested to shutdown, and 
50 shut itself down. 

4. Priority and load balance algorithms 
In a multi-engine, multi-run queue system, there is a 
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need to periodically assess and adjust the priority of 
55 processes and the engines on which those processes are 
queued to run on. The three major algorithms are: (1) 
timesliceO, which timeshares processes of equal priority 
within a particular run queue slot; (2) schedcpuO, which 
periodically adjusts the priority of processes assigned to 
engine 5. At time t6, engine 5 stops running process X, 60 different run queue slots in the same run queue; and (3) 
which then goes to sleep. At time te, the count of the load_balanceO, which periodically moves processes 
counter of engine number 5 is C 2 , which is stored in f roin one ^ queue to another to average the amount of 
memory as p_runtime. At time t7» process X becomes t0 be performed by each engine. The pseudo-code 

runnable, and at time tg, the scheduler decides whether f or tnese algorithms is given below, 
to enqueue runnable process X to the affinity run queue 65 
58 of engine 5 or to global run queue 54. At time ts, the — 

count of the counter of engine 1 is C2+2, which is 

stored as e_runtime. Therefore, D<w P =(C2-f-2)— C2 = 2. 



a. Cause timeslicing (times) ice) 

The following algorithm is called timeslice: 
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-continued they were in the prior art * The cacne affinit y decision is 

made once at the time the process becomes runnable, 



»• C«use timeslicingftiineslice) ^ {& nQ for thc ^ proccss , 0 be 

For each active run queue considered repeatedly as each engine looks for work. 

um^SJ^Z^T!^ priomy » Because there are separate run queues, there is also the 

engine. Nudge engine. possibility of moving the run queue interlock into the 

run queue itself, thus allowing multiple engines to 

The timeslicing algorithm is unchanged from the one schcdule dis P atch thcir * 

in the conventional UNIX scheduler. What has changed {Q Alternative embodiments of the invention 

is the way in which the algorithm is applied to the affin- „ v , . 

ity run queues, global run queue 54, and the FPA run * wlU * obvious to those having skill in the art that 

queue. In the system code of prior art system 10, both many chan S« ^ * m ***}° the details of the above- 

the FPA and affinity run queues were special cases. In described embodiment of this invention without depart- 

this invention, the same data structure (struct runq) is mg fr om the underlying principles thereof, 

used for all types of run queues, thereby making it possi- 15 For example, the invention is also applicable to multi- 

ble to generalize the code and run the same algorithm processor computing systems other than those using 

across each run queue on the active list. There is a UNIX operating systems. 

possible problem when an engine is a member of more System 50 could include more than one type of en- 

than one run queue. However, the e_npri field cor- gine, but only one global run queue. The global run 

rectly indicates what priority is pending on that engine 20 queue could queue processes of more than one type, 

(via nudge), so each successive run queue, as it is times- such as, for example, 386 type processes and FPA type 

liced, will treat the engine accordingly. processes. 

. „ / » j \ The calculation of cache context can be made closer 

b Change process priority (schedcpu) tQ |hc t - me ^ a process fa actual]y for examp i e( 

The schedcpuO algorithm is unchanged from the one when there is only one process in front of the process in 

used in the conventional UNIX schedulers. question. 

The estimation of cache context may consider kernel 

, , — , . „ , v , ~ processes as well as user processes. For example, the 

c. Cause bad balancing (load— balanceO) * r ■ i a u • * j t. *v 

^ n counter of an engine could be incremented when the 

p e following algorithm is called load-balance: hardclock routine is entered while an engine is running 

Scan all on-line engines v \ 

Record engine wilh lowest maxnrun value a Kernel process. 

Record engine with highest maxnrun value. In the preferred embodiment described above, the 

if lowest is two less than highesi scheduler considers only the value of De-p and whether 

n^K^ 35 the P rocess is amn *ed or unaffinitied in deciding 

Clear maxnrun value for ail engines. whether a process should be enqueued to an affinity run 

d. Calculate mixnnin queue or a global run queue. Alternatively, the sched- 

Tne following algorithm is called runq_maxnrun: uler could consider other factors such as the number 

For each on-line engine and/or priority of processes in the affinity run queue. 

Count how many processes are queued Thc ^duter could also consider how many processes 

If engine is running add one to this count *i_ ■ . . . , , , , . 

ir count is greater than maxnrun for this there are ,n other ™ <l ucues ™* now much data are 

engine expected to pass over system bus 12 in the next short 

Add 1/8 to maxnrun for this engine. time period. 

In the preferred embodiment, there is one cache 

The maxnrun routine coDects information about the 45 memory for each engine. Alternatively, an engine could 

overall process load of the various engines. The value of h^e more than one cache memory or share a cache 

maxnrun is used in the load balance algorithm. By sam- memory with one or more other engines, 

pling thc number of processes queued to each engine, In the Preferred embodiment, the same data structure 

the routine approximates how many processes are com- is used for each run queue. Alternatively, different data 

peting for each engine. By incrementing the maxnrun 50 structures could be used for different types of run 

value in tth count intervals, the routine filters the value queues. 

and prevents utilization spikes from adversely skewing The preferred embodiment employs load—balance, 

the maxnrun value, maxnrun is sampled 10 times per Alternatively or in addition, an engine could "steal" 

second. The runq_maxnruh algorithm could delete the work as follows. An idle engine could scan the affinity 

step of adding one to the count if an engine is running. 55 run queues of other engines for waiting processes that 

The load— balance routine runs every five seconds, could be run on the idle engine. This could have the 

which corresponds to 50 samples of the maxnrun value effect of the load—balance algorithm, but might achieve 

for each engine. The load—balance routine identifies this effect with less latency. This technique could defeat 

significant load imbalances and then causes a fraction of the effects of cache affinity by causing many more en- 

the processes to move from the most loaded engine to 60 gine-to-engine process switches than would occur oth- 

the least loaded engine. Maxnrun is cleared after each erwise. The cases where this technique can be used 

sampling interval. When there are ties among the most effectively are thus determined by the speed of the 

loaded or least loaded engines, the load— balance rou- engines, the size of the caches, and the CPU utilization 

tine operates on the first tie found. Round robin scan* characteristics of the processes being scheduled, 

ning is used to minimize "favoritism" on the part of the 65 The scope of the present invention should be deter- 

load balancing routine. mined only by the following claims. 

The advantage of the multi-run queue algorithm is I claim: 

that the scheduling code paths continue to be as short as 1. A computing system, comprising: 
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multiple computing engines that run processes, the 
multiple computing engines being associated with 
respective cache memories and respective affinity 
run queues; 

cache context estimating means for estimating an 5 
amount of cache context of a particular one of the 
cache memories with respect to a particular one of 
the processes; 

enqueuing means for enqueuing certain ones of the 
processes to the affinity run queues; and 10 

decision means responsive to the estimated amount of 
cache context for deciding whether to enqueue the 
particular process to a particular one of the affinity 
run queues. 

2. The system of claim 1 in which the certain ones of 15 
the processes that are enqueued to the affinity run 
queues comprise a first set of processes, and the enqueu- 
ing means also enqueues a second set of processes to at 
least one global run queue, where some of the processes 
are included in both the first and second sets of pro- 20 
cesses, and in which the decision means decides 
whether the particular process is to be enqueued to the 
particular affinity run queue or to one of the global run 
queues. 

3. The system of claim 2 in which each one of the 25 
affinity run queues and each one of the global run 
queues comprises an array of slots arranged in priority, 
each slot being capable of queuing a linked list of pro- 
cesses, 

4. The system of claim 2 in which one of the affinity 30 
run queues and one of the global run queues have sub- 
stantially identical data structure. 

5. The system of claim 2 in which there are at least 
two global run queues and the processes include pro- 
cesses of a particular type, and one of the global run 35 
queues may enqueue processes of the particular type 
and another one of the global run queues may not. 

6. The system of claim 1 in which the decision means 
considers whether the particular process is affinitied or 
unaffinitied in deciding whether to enqueue the particu- 40 
lar process to the particular affinity run queue. 

7. The system of claim 1 in which the cache context 
estimating means includes engine activity measuring 
means for counting engine activity time occurring from 

a time when the particular process leaves the particular 45 
computing engine to a later time. 

8. The system of claim 7 in which the engine activity 
measuring means includes a counter that counts units of 
engine activity. 

9. The system of claim 7 in which the later time is a 50 
time at which the decision means decides whether the 
particular process is to be queued to the particular affin- 
ity run queue. 

10. The system of claim 7 in which the processes 
include user processes and kernel processes, and the 55 
engine activity measuring means counts engine activity 
time occurring during user processes, not during kernel, 
processes. 

11. The system of claim 1 in which the decision means 

is responsive to a number of processes queued to the 60 
particular affinity queue in deciding whether the partic- 
ular process is to be queued to the particular affinity run 
queue. 

12. The system of claim 1 in which the decision means 

is responsive to a priority of the particular processed 65 
queued to the particular affinity queue in deciding 
whether the particular process is to be queued to the 
particular affinity run queue. 
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13. The system of claim 1 in which the decision means 
is responsive to a number of processes enqueued to the 
affinity run queue of the particular computing engine 
and a number of processes enqueued to the affinity run 
queues of other ones of the computing engines. 

14. The system of claim 1 in which the decision means 
is responsive to a number of the processes that are 
queued to ones of the affinity run queues other than the 
particular affinity run queue in deciding whether the 
particular process is to be queued to the particular affin- 
ity run queue. 

15. The system of claim 1 in which the decision means 
is responsive to an anticipated amount of data passing 
on a system bus in deciding whether the particular pro- 
cess is to be queued to the particular affinity run queue. 

16. The system of claim 1 in which the decision means 
decides whether the particular process is to be queued 
to the particular affinity run queue in response to the 
particular process being located in a predetermined 
section of the affinity run queue. 

17. A computing system, comprising: 
multiple computing engines that run processes, the 

multiple computing engines each being associated 
with a cache memory and an affinity run queue; 
cache context estimating means for estimating with 
respect to a particular one of the processes an 
amount of cache context of the cache memory 
associated with a particular one of the multiple 
computing engines, which ran the particular pro- 
cess; 

enqueuing means for enqueuing certain ones of the 

processes to the affinity run queues; and 
decision means responsive to the amount of cache 
context for deciding whether to enqueue the par- 
ticular process to the affinity run queue associated 
with the particular computing engine. 

18. A computing system, comprising: 
multiple computing engines that run processes, the 

multiple computing engines each being associated 
with a respective cache memory, a respective affin- 
ity run queue, and at least one global run queue; 
cache context estimating means for estimating an 
amount of cache context of a particular one of the 
cache memories with respect to a particular one of 
the processes; 
enqueuing means for enqueuing certain ones of the 

processes to the affinity run queues; and 
decision means responsive to the amount of cache 
context for deciding whether to enqueue the par- 
ticular process to the affinity run queue associated 
with the particular computing engine or to one of 
the global run queues. 

19. A computing system, comprising: 
multiple computing engines that run processes in- 
cluding unaffinitied processes and affinitied pro- 
cesses, the multiple computing engines being asso- 
ciated with respective cache memories, respective 
affinity run queues, and at least one global run 
queue; 

cache context estimating means for estimating an 
amount of cache context of a particular one of the 
cache memories with respect to a particular one of 
the processes; 
enqueuing means for enqueuing certain ones of the 

processes to the affinity run queues; and 
decision means responsive to the estimated amount of 
cache context for deciding whether a particular 
unaffinitied process should become affinitied to a 
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particular one of the computing engines and en* 
queued to the affinity run queue of the particular 
computing engine, and for deciding whether a 
particular affinitied process should become unaffi- 
nitied and enqueued to one of the global run 5 
queues. 

20. A computing system, comprising: 

multiple computing engines that run processes and 
are associated with respective affinity run queues, 
respective particular numbers of the processes 10 
being associated with each computing engine, the 
respective particular numbers being at least zero; 

storage means for storing multiple variables each 
having a value and each respectively associated 
with the multiple computing engines, each respec- 
tive particular number corresponding to one of the 
variables; 

sampling means for repeatedly sampling the respec- 
tive particular numbers of the processes associated 2Q 
with each computing engine; 

determining means for determining which of the re- 
spective particular numbers of processes are 
greater than the corresponding variable values; 

increasing means for increasing each one of the van- 25 
able values for which a corresponding respective 
particular number is determined to be greater; and 

moving means for transferring certain ones of the 
processes from one of the affinity run queues asso- 
ciated with one of the computing engines associ- 30 
ated with a highest variable value to one of the 
affinity run queues associated with one of the com- 
puting engines associated with a lowest variable 
value. 

21. The system of claim 20 in which each one of the 35 
respective particular number of the processes is an inte- 
ger equal to a number of the processes in the affinity run 
queue of the respective computing engine plus one if 
one of the processes is being run by the respective com- 
puting engine when a sample is made. 40 

22. A method for assigning processes to run queues in 
a multi-engine computing system, the method compris- 
ing the steps of: 

queuing a process to a global run queue; 
running the process on a first computing engine 45 
which is associated with a first cache memory; 
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storing the process in a memory; 

storing a first count of a counter associated with the 

first computing engine at a time that the process is 

stored; 

storing a second count of the counter at a time that 
the process becomes runnable; 

comparing the first and second counts to estimate the 
amount of cache context remaining in the first 
cache memory with respect to the process; and 

deciding whether to enqueue the process to an affin- 
ity run queue associated with the first computing 
engine or to another run queue based on the esti- 
mated amount of cache context. 

23. The system of claim 21 further comprising prior- 
ity considering means for considering a priority of one 
of the processes in determining whether the process 
should be transferred. 

24. A computing system, comprising: 

multiple computing engines mat run processes and 
are associated with respective affinity run queues, 
respective particular numbers of the processes 
being associated with each computing engine, the 
respective particular numbers being at least zero; 

storage means for storing multiple variables each 
having a value and each respectively associated 
with one of the multiple computing engines, each 
respective particular number corresponding to one 
of the variables; 

determining means for determining which of the re- 
spective particular numbers of processes are 
greater than the corresponding variable values; 

increasing means for increasing each one of the vari- 
able values for which a corresponding respective 
particular number is determined to be greater; and 

moving means for transferring certain ones of the 
processes from one of the affinity run queues asso- 
ciated with one of the computing engines having a 
particular variable value to another one of the 
affinity run queues associated with one of the com- 
puting engines associated with a lower variable 
value. 

25. The system of claim 24 further comprising prior- 
ity considering means for considering a priority of one 
of the processes in determining whether the process 

should be transferred. 

***** 
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UNITED STATES PATENT AND TRADEMARK OFFICE 

CERTIFICATE OF CORRECTION 

PATENT NO. : 5 , 185,8 61 Pa S e 1 of 2 

DATED : February 9, 1993 
INVENTOR(S) : Andrew J. Valencia 

It is certified that error appears in the above-identified patent and that said Letters Patent is hereby 
corrected as shown below: 



Column 1, line 6, change "multiprocessor" to 
— multi-processor — . 

Column l r line 64 , change n m =n M to — m = n — . 
Column 2, line 3, change w 4" to — 34 — . 

Column 3 r line 10, change "realworld" to — real-world — ♦ 
Column 3, line 55, change "12 12 " to — 12 p — * 

Column 8, line 13, change "at time t_D e " to 

— at time t — . ~ p 
e— p 

Column 9, line 27, change "(e.g., less" to — (e.g. less — . 

Column 12, line 67, after n I-i ower " insert — = 3 — . 

Column 15, line 47, change "load balance" to 
— load balance — . 
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UNITED STATES PATENT AND TRADEMARK OFFICE 

CERTIFICATE OF CORRECTION 

PATENT NO. : 5,185,861 Page 2 of 2 

DATED : February 9> 1993 

INVENTOR(S) : 

Andrew J. Valencia 

It is certified that error appears in the above-indentified patent and that said Letters Patent is hereby 
corrected as shown below: 

Column 17, line 65, change "processed" to — processes — • 
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Signed and Sealed this 
Eleventh Day of January, 1994 



BRUCE LEHMAN 

Attesting Officer Commissioner of Patents and Trademarks 
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